Abstract
Background: In an evolving field with rapid changes in standard of care therapies, second opinions and multidisciplinary review (MDR) can inform the management of patients with cancer, yet these resources are not universally available. We gathered a panel of experts in hematology, medical oncology, surgical oncology, radiation oncology, and radiology, for MDR of over 400 anonymized complex cancer cases spanning various tumor types including hematological cancers. MDR treatment recommendations were captured. Herein, we analyze how the 3 most commonly used large language models (LLMs) performed, compared with the MDR expert panel recommendations, and compared with one another.
Methods: We retrieved 38 complex malignant hematology cases previously adjudicated by MDR panels from the larger database of over 400 anonymized cancer cases collected and reviewed between 2020 and 2021. These cases were analyzed by 3 different foundational LLMs (OpenAI's ChatGPT 4.5, Anthropic's Claude Opus 4, and Google's Gemini Ultra) using PrecisCa's proprietary method for prompts. We then scored these recommendations from each system on a scale of 1-5 (5 being the highest) for 6 categories: completeness, reasoning, clarity, menu of options, recency, and relevance as compared to the MDR panel recommendations. The maximum possible competence score was 30 points per case, with a maximum aggregate score of 1,140 for all 38 cases. Final LLM recommendations were also reviewed in comparison with current National Comprehensive Cancer Network (NCCN) guidelines for serious omissions. Comparison in reverse (additional LLM options that the experts may have missed) was not performed, as many treatment recommendations have changed over the 4 years since the MDRs.
Results: Patient characteristics included diagnoses of non-Hodgkin lymphoma (NHL) (n=15, 39.5%), multiple myeloma (MM) (n=14, 36.8%), Hodgkin lymphoma (n=5, 13.2%), and 4 individual cases of acute myeloid leukemia, chronic myeloid leukemia, acute lymphoblastic leukemia, and chronic lymphocytic leukemia (n=4, 10.5%). The median age of all patients was 49.5 years (range: 21–78). Competence scores for ChatGPT 4.5, Claude Opus 4, and Gemini Ultra demonstrated an aggregate/median competence score (range) of 849/22.5 (15-30), 932/22.5 (15-30), and 964/22.5 (15-30), respectively. When assessing aggregate performance of all LLMs across diseases, they performed best in NHL cases, and the worst in MM ones. In NHL, menu of options was the highest rated category, while reasoning was theworst. In MM, completeness was best, and relevance was the worst. Overall, ChatGPT 4.5 performed worse than the other LLMs regardless of the disease category.
Although there was some variation in the concordance and competence rates of the LLMs, all 3 had good concordance with the expert MDR recommendations. Discordant cases were reviewed and primarily involved minor differences that would not have altered the patients' management significantly.
Conclusions: This study demonstrates good concordance between 3 leading LLMs and expert MDR recommendations for common yet complex hematologic cancer clinical scenarios. LLMs output may differ based on the disease in question. ChatGPT 4.5 overall was not as accurate as the other two evaluated LLMs in our admittedly small sample size. These findings suggest that LLM tools have continued to improve and may serve as valuable decision support aids in malignant hematology practice, particularly where expert review may be limited or unavailable. Careful human oversight remains essential to ensure safe and personalized cancer care.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal